Introduction

This IPython notebook illustrates how to select the best learning based matcher. First, we need to import py_entitymatching package and other libraries as follows:


In [1]:
# Import py_entitymatching package
import py_entitymatching as em
import os
import pandas as pd

# Set the seed value 
seed = 0

In [2]:
!ls $datasets_dir


Adding Features to Feature Table.ipynb
Combining Multiple Blockers.ipynb
Debugging Blocker Output.ipynb
Down Sampling.ipynb
Editing and Generating Features Manually.ipynb
Evaluating the Selected Matcher.ipynb
Generating Features Manually.ipynb
Performing Blocking Using Blackbox Blocker.ipynb
Performing Blocking Using Built-In Blockers (Attr. Equivalence Blocker).ipynb
Performing Blocking Using Built-In Blockers (Overlap Blocker).ipynb
Performing Blocking Using Rule-Based Blocking.ipynb
Reading CSV Files from Disk.ipynb
Removing Features From Feature Table.ipynb
Sampling and Labeling.ipynb
Selecting the Best Learning Matcher.ipynb

In [3]:
# Get the datasets directory
datasets_dir = em.get_install_path() + os.sep + 'datasets'

path_A = datasets_dir + os.sep + 'dblp_demo.csv'
path_B = datasets_dir + os.sep + 'acm_demo.csv'
path_labeled_data = datasets_dir + os.sep + 'labeled_data_demo.csv'

In [5]:
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')
# Load the pre-labeled data
S = em.read_csv_metadata(path_labeled_data, 
                         key='_id',
                         ltable=A, rtable=B, 
                         fk_ltable='ltable_id', fk_rtable='rtable_id')


No handlers could be found for logger "py_entitymatching.io.parsers"

Then, split the labeled data into development set and evaluation set. Use the development set to select the best learning-based matcher


In [6]:
# Split S into I an J
IJ = em.split_train_test(S, train_proportion=0.5, random_state=0)
I = IJ['train']
J = IJ['test']

Selecting the Best learning-based matcher

This, typically involves the following steps:

  1. Creating a set of learning-based matchers
  2. Creating features
  3. Extracting feature vectors
  4. Selecting the best learning-based matcher using k-fold cross validation
  5. Debugging the matcher (and possibly repeat the above steps)

Creating a set of learning-based matchers

First, we need to create a set of learning-based matchers. The following matchers are supported in Magellan: (1) decision tree, (2) random forest, (3) naive bayes, (4) svm, (5) logistic regression, and (6) linear regression.


In [7]:
# Create a set of ML-matchers
dt = em.DTMatcher(name='DecisionTree', random_state=0)
svm = em.SVMMatcher(name='SVM', random_state=0)
rf = em.RFMatcher(name='RF', random_state=0)
lg = em.LogRegMatcher(name='LogReg', random_state=0)
ln = em.LinRegMatcher(name='LinReg')

Creating features

Next, we need to create a set of features for the development set. Magellan provides a way to automatically generate features based on the attributes in the input tables. For the purposes of this guide, we use the automatically generated features.


In [8]:
# Generate a set of features
F = em.get_features_for_matching(A, B, validate_inferred_attr_types=False)

We observe that there were 20 features generated. As a first step, lets say that we decide to use only 'year' related features.


In [9]:
F.feature_name


Out[9]:
0                          id_id_lev_dist
1                           id_id_lev_sim
2                               id_id_jar
3                               id_id_jwn
4                               id_id_exm
5                   id_id_jac_qgm_3_qgm_3
6             title_title_jac_qgm_3_qgm_3
7         title_title_cos_dlm_dc0_dlm_dc0
8                         title_title_mel
9                    title_title_lev_dist
10                    title_title_lev_sim
11        authors_authors_jac_qgm_3_qgm_3
12    authors_authors_cos_dlm_dc0_dlm_dc0
13                    authors_authors_mel
14               authors_authors_lev_dist
15                authors_authors_lev_sim
16                          year_year_exm
17                          year_year_anm
18                     year_year_lev_dist
19                      year_year_lev_sim
Name: feature_name, dtype: object

Extracting feature vectors

In this step, we extract feature vectors using the development set and the created features.


In [10]:
# Convert the I into a set of feature vectors using F
H = em.extract_feature_vecs(I, 
                            feature_table=F, 
                            attrs_after='label',
                            show_progress=False)

In [11]:
# Display first few rows
H.head()


Out[11]:
_id ltable_id rtable_id id_id_lev_dist id_id_lev_sim id_id_jar id_id_jwn id_id_exm id_id_jac_qgm_3_qgm_3 title_title_jac_qgm_3_qgm_3 ... authors_authors_jac_qgm_3_qgm_3 authors_authors_cos_dlm_dc0_dlm_dc0 authors_authors_mel authors_authors_lev_dist authors_authors_lev_sim year_year_exm year_year_anm year_year_lev_dist year_year_lev_sim label
430 430 l1494 r1257 4 0.20 0.466667 0.466667 0 0.000000 0.000000 ... 0.000000 0.000000 0.445707 44.0 0.083333 1 1.0 0.0 1.0 0
35 35 l1385 r1160 4 0.20 0.466667 0.466667 0 0.000000 0.025641 ... 0.000000 0.000000 0.589417 43.0 0.271186 1 1.0 0.0 1.0 0
394 394 l1345 r85 4 0.20 0.000000 0.000000 0 0.090909 1.000000 ... 0.951111 0.945946 0.822080 172.0 0.338462 1 1.0 0.0 1.0 1
29 29 l611 r141 3 0.25 0.666667 0.666667 0 0.090909 0.049383 ... 0.000000 0.000000 0.531543 26.0 0.277778 1 1.0 0.0 1.0 0
181 181 l1164 r1161 2 0.60 0.733333 0.733333 0 0.076923 1.000000 ... 0.592593 0.668153 0.684700 34.0 0.244444 1 1.0 0.0 1.0 1

5 rows × 24 columns


In [12]:
# Check if the feature vectors contain missing values
# A return value of True means that there are missing values
any(pd.notnull(H))


Out[12]:
True

We observe that the extracted feature vectors contain missing values. We have to impute the missing values for the learning-based matchers to fit the model correctly. For the purposes of this guide, we impute the missing value in a column with the mean of the values in that column.


In [13]:
# Impute feature vectors with the mean of the column values.
H = em.impute_table(H, 
                exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
                strategy='mean')

Selecting the best matcher using cross-validation

Now, we select the best matcher using k-fold cross-validation. For the purposes of this guide, we use five fold cross validation and use 'precision' metric to select the best matcher.


In [14]:
# Select the best ML matcher using CV
result = em.select_matcher([dt, rf, svm, ln, lg], table=H, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
        k=5,
        target_attr='label', metric_to_select_matcher='f1', random_state=0)
result['cv_stats']


Out[14]:
Matcher Average precision Average recall Average f1
0 DecisionTree 0.915322 0.950714 0.930980
1 RF 1.000000 0.950714 0.974131
2 SVM 0.977778 0.810632 0.883248
3 LinReg 1.000000 0.935330 0.966131
4 LogReg 0.985714 0.935330 0.958724

In [15]:
result['drill_down_cv_stats']['precision']


Out[15]:
Name Matcher Num folds Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean score
0 DecisionTree <py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990> 5 0.95 1.000000 0.764706 0.933333 0.928571 0.915322
1 RF <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310> 5 1.00 1.000000 1.000000 1.000000 1.000000 1.000000
2 SVM <py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390> 5 1.00 1.000000 0.888889 1.000000 1.000000 0.977778
3 LinReg <py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0> 5 1.00 1.000000 1.000000 1.000000 1.000000 1.000000
4 LogReg <py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210> 5 1.00 0.928571 1.000000 1.000000 1.000000 0.985714

In [16]:
result['drill_down_cv_stats']['recall']


Out[16]:
Name Matcher Num folds Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean score
0 DecisionTree <py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990> 5 0.95 1.000000 0.928571 0.8750 1.000000 0.950714
1 RF <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310> 5 0.95 1.000000 0.928571 0.8750 1.000000 0.950714
2 SVM <py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390> 5 0.90 0.923077 0.571429 0.8125 0.846154 0.810632
3 LinReg <py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0> 5 0.95 1.000000 0.928571 0.8750 0.923077 0.935330
4 LogReg <py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210> 5 0.95 1.000000 0.928571 0.8750 0.923077 0.935330

In [17]:
result['drill_down_cv_stats']['f1']


Out[17]:
Name Matcher Num folds Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean score
0 DecisionTree <py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990> 5 0.950000 1.000000 0.838710 0.903226 0.962963 0.930980
1 RF <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310> 5 0.974359 1.000000 0.962963 0.933333 1.000000 0.974131
2 SVM <py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390> 5 0.947368 0.960000 0.695652 0.896552 0.916667 0.883248
3 LinReg <py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0> 5 0.974359 1.000000 0.962963 0.933333 0.960000 0.966131
4 LogReg <py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210> 5 0.974359 0.962963 0.962963 0.933333 0.960000 0.958724

Debug X (Random Forest)


In [18]:
# Split H into P and Q
PQ = em.split_train_test(H, train_proportion=0.5, random_state=0)
P = PQ['train']
Q = PQ['test']

In [19]:
# Debug RF matcher using GUI
em.vis_debug_rf(rf, P, Q, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
        target_attr='label')

In [20]:
# Add a feature to do Jaccard on title + authors and add it to F

# Create a feature declaratively
sim = em.get_sim_funs_for_matching()
tok = em.get_tokenizers_for_matching()
feature_string = """jaccard(wspace((ltuple['title'] + ' ' + ltuple['authors']).lower()), 
                            wspace((rtuple['title'] + ' ' + rtuple['authors']).lower()))"""
feature = em.get_feature_fn(feature_string, sim, tok)

# Add feature to F
em.add_feature(F, 'jac_ws_title_authors', feature)


Out[20]:
True

In [21]:
# Convert I into feature vectors using updated F
H = em.extract_feature_vecs(I, 
                            feature_table=F, 
                            attrs_after='label',
                            show_progress=False)

In [22]:
# Check whether the updated F improves X (Random Forest)
result = em.select_matcher([rf], table=H, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
        k=5,
        target_attr='label', metric_to_select_matcher='f1', random_state=0)
result['drill_down_cv_stats']['f1']


Out[22]:
Name Matcher Num folds Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean score
0 RF <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310> 5 0.974359 1.0 0.962963 0.933333 1.0 0.974131

In [23]:
# Select the best matcher again using CV
result = em.select_matcher([dt, rf, svm, ln, lg], table=H, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
        k=5,
        target_attr='label', metric_to_select_matcher='f1', random_state=0)
result['cv_stats']


Out[23]:
Matcher Average precision Average recall Average f1
0 DecisionTree 1.000000 1.000000 1.000000
1 RF 1.000000 0.950714 0.974131
2 SVM 1.000000 0.837418 0.907995
3 LinReg 1.000000 0.970330 0.984593
4 LogReg 0.985714 0.935330 0.958724

In [24]:
result['drill_down_cv_stats']['f1']


Out[24]:
Name Matcher Num folds Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean score
0 DecisionTree <py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990> 5 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
1 RF <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310> 5 0.974359 1.000000 0.962963 0.933333 1.000000 0.974131
2 SVM <py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390> 5 0.947368 0.960000 0.782609 0.933333 0.916667 0.907995
3 LinReg <py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0> 5 1.000000 1.000000 0.962963 1.000000 0.960000 0.984593
4 LogReg <py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210> 5 0.974359 0.962963 0.962963 0.933333 0.960000 0.958724